Part 1
Part I
Lesson 1
Learning about the mechanics of wrangling data. Particularly mutate() and summarise() functions in an efficient manner of the databases to be used: penguins and ames.
The dimensions of the penguins database: 344 observations and 7 columns or variables
Code
dim(penguins)[1] 344 7
To have a good overview of the database, we can use the skim() function from the skimr package. It provides a summary statistics, ad report the proportion of missing cases
| Name | palmerpenguins::penguins |
| Number of rows | 344 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
| island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
| sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
| year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
An alternative to skimr for descriptive statistics is gt_plt_summary() function of the gtExtras package. For it to work, you need to select the variables in advance. Note:
Code
| select(penguins, bill_length_mm, bill_depth_mm) | ||||||
|---|---|---|---|---|---|---|
| 344 rows x 2 cols | ||||||
| Column | Plot Overview | Missing | Mean | Median | SD | |
| bill_length_mm | 0.6% | 43.9 | 44.5 | 5.5 | ||
| bill_depth_mm | 0.6% | 17.2 | 17.3 | 2.0 | ||
If you want to focus on one variable, you can deposit it into the skim() function. For instance, examining the properties of Lot Frontage and Lot_Area
| Name | modeldata::ames |
| Number of rows | 2930 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Lot_Frontage | 0 | 1 | 57.65 | 33.50 | 0 | 43.00 | 63.0 | 78.00 | 313 | ▇▇▁▁▁ |
| Lot_Area | 0 | 1 | 10147.92 | 7880.02 | 1300 | 7440.25 | 9436.5 | 11555.25 | 215245 | ▇▁▁▁▁ |
Notice the ames data has not nice variable names. Good idea to use janitor::clean_names to convert capital letters into lower letters.
Code
modeldata::ames |>
janitor::clean_names() |>
skimr::skim(lot_frontage,lot_area)| Name | janitor::clean_names(mode… |
| Number of rows | 2930 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| lot_frontage | 0 | 1 | 57.65 | 33.50 | 0 | 43.00 | 63.0 | 78.00 | 313 | ▇▇▁▁▁ |
| lot_area | 0 | 1 | 10147.92 | 7880.02 | 1300 | 7440.25 | 9436.5 | 11555.25 | 215245 | ▇▁▁▁▁ |
Lesson 2
The focus is the mutate() function. It also deals with vectorized vs iterative calculations with mutate. It introduces the map() function. This function aligns a function to a vector, compelling the estimating of one-by-one instead of vector mashing. The variations of map() discussed include the map_lgl() (map logical) and the map_dbl() (map numeric). The first one produces a logical vector. The second produces numeric values of 1 (True) and 0 (False).
Handling missing cases
It is important to handle missing values before doing mutate(). We can verify missing cases were removed using skimr.
Code
| Name | penguins_wo_NAs |
| Number of rows | 333 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1 | FALSE | 3 | Ade: 146, Gen: 119, Chi: 68 |
| island | 0 | 1 | FALSE | 3 | Bis: 163, Dre: 123, Tor: 47 |
| sex | 0 | 1 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 0 | 1 | 43.99 | 5.47 | 32.1 | 39.5 | 44.5 | 48.6 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 0 | 1 | 17.16 | 1.97 | 13.1 | 15.6 | 17.3 | 18.7 | 21.5 | ▅▆▇▇▂ |
| flipper_length_mm | 0 | 1 | 200.97 | 14.02 | 172.0 | 190.0 | 197.0 | 213.0 | 231.0 | ▂▇▃▅▃ |
| body_mass_g | 0 | 1 | 4207.06 | 805.22 | 2700.0 | 3550.0 | 4050.0 | 4775.0 | 6300.0 | ▃▇▅▃▂ |
| year | 0 | 1 | 2008.04 | 0.81 | 2007.0 | 2007.0 | 2008.0 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
Handling vectors
Mutate is a vectorised function. This means that it takes the two vectors listed in the mutate function as a whole. In other words the values of each column are not taken one-by-one. The values are handled as a vecto, and then they are operated via a vectorized division (/).
Imagine these two vectors are “smashed together” and the result is divided.
Code
# A tibble: 333 × 3
bill_length_mm bill_depth_mm bill_flipper_ratio
<dbl> <dbl> <dbl>
1 39.1 18.7 2.09
2 39.5 17.4 2.27
3 40.3 18 2.24
4 36.7 19.3 1.90
5 39.3 20.6 1.91
6 38.9 17.8 2.19
7 39.2 19.6 2
8 41.1 17.6 2.34
9 38.6 21.2 1.82
10 34.6 21.1 1.64
# ℹ 323 more rows
Example where the vectorized condition may not be met. Using a custom function to illustrate this problem. Next let’s apply the function to one of the vectors used in estimating bill_flipper_ratio. In this case, bill_length_mm.
Notice that the application of the function prompted an error message. It was traced back to the if (x > 39) ...: The result shows the condition has length > 1.
Functional programming or map()
map_lgl() enforces the one-by-one calculation instead of the vector calculation
To address the limitation of working with vectors, we need functional programming. map() allows one-to-one calculation operation. This operation is called upon by the large_quantity function within the mutate() function.
In other words, instead of using the vector smashing approach to handle observations contained in the variable, we rely on handling each variable listed in the mutate function.
We rely on the map() function. In this case the logical map function ( map_lgl()). And then we instruct R to apply the large_quantity() function to each component, one-by-one, of the column or vector bill_lenght_mm. Notice, map_lgl() lists the vector or column first, and then the function to be applied to it. In other words, map_lgl() applies the function to each element of the vector or column instead of smashing the vector.
“The map functions transform their input by applying a function to each element of a list or atomic vector and returning an object of the same length as the input.”
Code
# A tibble: 333 × 2
bill_length_mm large_bill_length
<dbl> <lgl>
1 39.1 TRUE
2 39.5 TRUE
3 40.3 TRUE
4 36.7 FALSE
5 39.3 TRUE
6 38.9 FALSE
7 39.2 TRUE
8 41.1 TRUE
9 38.6 FALSE
10 34.6 FALSE
# ℹ 323 more rows
Another variation is the map() itself. But instead of producing individual values, map() produces lists of values. However, the list of variables are reported in a vector.
Code
# A tibble: 333 × 2
bill_length_mm large_bill_length
<dbl> <list>
1 39.1 <lgl [1]>
2 39.5 <lgl [1]>
3 40.3 <lgl [1]>
4 36.7 <lgl [1]>
5 39.3 <lgl [1]>
6 38.9 <lgl [1]>
7 39.2 <lgl [1]>
8 41.1 <lgl [1]>
9 38.6 <lgl [1]>
10 34.6 <lgl [1]>
# ℹ 323 more rows
Still another variation of the map() function is the map_dbl(). It reports numbers instead of logical statements. Notice, we list first the column to be modified followed by the function (large_quantity).
Code
# A tibble: 333 × 2
bill_length_mm large_bill_length
<dbl> <dbl>
1 39.1 1
2 39.5 1
3 40.3 1
4 36.7 0
5 39.3 1
6 38.9 0
7 39.2 1
8 41.1 1
9 38.6 0
10 34.6 0
# ℹ 323 more rows
Lesson 3
This section covers the summarise()function. This function is the opposite of the mutate() function. It also works with vectors, but the difference is the amount of output. While mutate can be used to produce vectors, summarise turns out single elements, or atomic values.
Code
# A tibble: 1 × 2
mean_flipper_length sd_flipper_length
<dbl> <dbl>
1 201. 14.0
You can also save the lengths into one single list using the list() function. In other words, the last column, labeled flipper_lenghts() contains a vector of values. This last column has 333 values corresponding to 333 penguins’ flipper lengths.
Code
# A tibble: 1 × 3
mean_flipper_length sd_flipper_length flipper_lengths
<dbl> <dbl> <list>
1 201. 14.0 <int [333]>
Lesson 4
Handling missing values with common calculations. If not handled, the summarise function would generate NAs. One option is to use na.rm = TRUE.
Code
# A tibble: 1 × 3
mean_flipper_length sd_flipper_length flipper_lengths
<dbl> <dbl> <list>
1 201. 14.1 <int [344]>
Lesson 5
Repeated calculations for subsets of data. Two options to automate repeated calculations. for() loop. Another is to rely on loop calculations. An example is to rely on the data without missing values.
One option is to rely on the option .by = () to specify subgroups. Be careful about omitting the “.” It would produce an output but would miss summarizing results by species.
Code
# A tibble: 3 × 4
species mean_flipper_length sd_flipper_length flipper_lengths
<fct> <dbl> <dbl> <list>
1 Adelie 190. 6.52 <int [146]>
2 Gentoo 217. 6.59 <int [119]>
3 Chinstrap 196. 7.13 <int [68]>
Warning regarding grouping in a series of concatenated computations as show below. group_by() affects the original structure of the data by the first variable in listed in the group_by() function.
Notice the output indicates #Groups : species[3] Or 3 species by 5 islands. Adding another summarise() function generates a problem. Suppose we want to estimate the mean of the mean_flipper_length estimated in the previous run as well as the sd of the sd_flipper_lenght variable previously created. The end result is an sd_flipper_length with two NAs.
During the first calculation, group_by() strips the last group alluded in the function. In other words, the variable or column island is eliminated. The end result is a 5 X 5 tibble table organized by three groups corresponding to the 3 species. This is an opportunity to mistakes if one ignores the structure imposed on the data by group_by() function. Grouped calculations behave differently.
Code
# A tibble: 3 × 3
species mean_flipper_length sd_flipper_length
<fct> <dbl> <dbl>
1 Adelie 190. 0.255
2 Chinstrap 196. NA
3 Gentoo 217. NA
To remove the underlying grouping structure, remove the grouping using the option .groups = “drop’ in the summarise() function. This option restores the original tibble table of 5 X 6.
Code
# A tibble: 5 × 5
species island mean_flipper_length sd_flipper_length flipper_lengths
<fct> <fct> <dbl> <dbl> <list>
1 Adelie Biscoe 189. 6.73 <int [44]>
2 Adelie Dream 190. 6.48 <int [55]>
3 Adelie Torgersen 192. 6.22 <int [47]>
4 Chinstrap Dream 196. 7.13 <int [68]>
5 Gentoo Biscoe 217. 6.59 <int [119]>
Another alternative to .groups = "drop" is the function .by = c(species,island) , which also restores the original data structure. Notice that in this case, the .by = function replaces the group_by() function.
Code
# A tibble: 5 × 5
species island mean_flipper_length sd_flipper_length flipper_lengths
<fct> <fct> <dbl> <dbl> <list>
1 Adelie Torgersen 192. 6.22 <int [47]>
2 Adelie Biscoe 189. 6.73 <int [44]>
3 Adelie Dream 190. 6.48 <int [55]>
4 Gentoo Biscoe 217. 6.59 <int [119]>
5 Chinstrap Dream 196. 7.13 <int [68]>
Lesson 6
Unique function
Using grouping .by =function with mutate can produce a long list of means. In this case, a long list of means by species. Notice, the mean column only reports one single mean value in the screen. To see the rest of the means across species, you need to create a column, say test, and print it.
Code
# A tibble: 344 × 3
bill_length_mm flipper_length_mm mean_column
<dbl> <int> <dbl>
1 39.1 181 38.8
2 39.5 186 38.8
3 40.3 195 38.8
4 NA NA 38.8
5 36.7 193 38.8
6 39.3 190 38.8
7 38.9 181 38.8
8 39.2 195 38.8
9 34.1 193 38.8
10 42 190 38.8
# ℹ 334 more rows
In other to make sense of the the listing, let’s just select the species column and the mean column.
Code
# A tibble: 344 × 2
species mean_column
<fct> <dbl>
1 Adelie 38.8
2 Adelie 38.8
3 Adelie 38.8
4 Adelie 38.8
5 Adelie 38.8
6 Adelie 38.8
7 Adelie 38.8
8 Adelie 38.8
9 Adelie 38.8
10 Adelie 38.8
# ℹ 334 more rows
In order to avoid a long list of 344 means by species, use the unique() function. The unique function eliminates duplicates. There were many duplicates of the mean for each species which unique eliminated by reporting three species.
The .by = species forces that the calculations take place separately by the components of the column referenced to, in this case, 3 species of penguins. unique(), on the other hand, excludes duplicated means within each species or rows.
Centering a variable by its mean
Centering a variable represents an ideal situation to use the function .by = in a mutate operator instead of using summarise. A situation to use the scale function instead of reporting the means for each group as we did with mutate(). “scale is a generic R function whose default method centers and/or scales the columns of a numeric matrix.”
Notice, we are centering, or scaling each penguin’s species bill length to its mean. Assuring transparency by requesting alpha = 0.5.
Centering is a technique to help display the distribution as normal. It is also a popular technique in machine learning. It also helps to locate the three species’s distribution within a common line. It allows comparisons across species, or group scaling.
Code
Without centering, you can see that the three distribution have different scales. The lengths within each distribution vary between 30 to 60. So the location within each species is different.
Code
palmerpenguins::penguins |>
# mutate(
# bill_length_mm = scale(bill_length_mm),
# .by = species) |>
ggplot() +
geom_density(aes(x = bill_length_mm, fill = species),
alpha = 0.5) Another option is to center within each species instead of across species. As shown below it produced a graph without a common center. This is why if you want to make group comparisons, you need to use the .by option, which produces a common center or zero point.
Lesson 7
Missing data with group calculations
Notice not all combinations between species and islands are displayed. For instance Chinstrap -Biscoe, Chinstrap - Torgesen, Gentoo - Dream, and Gentoo-Torgeson combinations are not displayed.
Code
# A tibble: 5 × 5
species island mean_flipper_length sd_flipper_length flipper_lengths
<fct> <fct> <dbl> <dbl> <list>
1 Adelie Torgersen 191. 6.23 <int [52]>
2 Adelie Biscoe 189. 6.73 <int [44]>
3 Adelie Dream 190. 6.59 <int [56]>
4 Gentoo Biscoe 217. 6.48 <int [124]>
5 Chinstrap Dream 196. 7.13 <int [68]>
To display all combinations between species and islands, even for those combinations with no observations we could use the following complete() function.
complete() allows one to display all combinations among variables even with missing data. It also allows to change the default NA value for another, say 0, using the fill() function.
Code
penguins_wo_NAs |>
summarise(
mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
sd_flipper_length = sd(flipper_length_mm, na.rm = TRUE),
flipper_lengths = list(flipper_length_mm),
.by = c(species,island) ) |>
complete(
species, island,
fill = list(mean_flipper_length = 0,
sd_flipper_length = 0)
)# A tibble: 9 × 5
species island mean_flipper_length sd_flipper_length flipper_lengths
<fct> <fct> <dbl> <dbl> <list>
1 Adelie Biscoe 189. 6.73 <int [44]>
2 Adelie Dream 190. 6.48 <int [55]>
3 Adelie Torgersen 192. 6.22 <int [47]>
4 Chinstrap Biscoe 0 0 <NULL>
5 Chinstrap Dream 196. 7.13 <int [68]>
6 Chinstrap Torgersen 0 0 <NULL>
7 Gentoo Biscoe 217. 6.59 <int [119]>
8 Gentoo Dream 0 0 <NULL>
9 Gentoo Torgersen 0 0 <NULL>
Lesson 8
Using across() + mutate() instead of repetitions.
Original mutate
Code
# A tibble: 1 × 4
mean_bill_depth mean_bill_length mean_flipper_length mean_body_mass
<dbl> <dbl> <dbl> <dbl>
1 17.2 44.0 201. 4207.
Using ’across()` function. Several variations.
Also using glue() to affix a name to the corresponding output. This is done via the functions .names = 'mean_{.col}'. This glue function affix the legend mean_ to all the columns/variables referenced in the .cols = c( function.
Notice the output does not display the prefix mean_. If you want it you need the option .names , which is a glue specification.
Code
# A tibble: 1 × 4
mean_bill_length_mm mean_bill_depth_mm mean_flipper_length_mm mean_body_mass_g
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
Code
# A tibble: 1 × 4
mean_bill_length_mm mean_bill_depth_mm mean_flipper_length_mm mean_body_mass_g
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
Code
# A tibble: 1 × 5
mean_bill_length_mm mean_bill_depth_mm mean_flipper_length_mm mean_body_mass_g
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
# ℹ 1 more variable: mean_year <dbl>
Code
# A tibble: 1 × 2
mean_bill_length_mm mean_bill_depth_mm
<dbl> <dbl>
1 44.0 17.2
Code
#Selecting columns with specific names
penguins_wo_NAs |>
summarise(across(starts_with('flipper'),
function(x) mean(x, na.rm = TRUE),
.names = 'mean_{.col}') )# A tibble: 1 × 1
mean_flipper_length_mm
<dbl>
1 201.
Removing unwanted strings
Note that we could remove unwanted strings such as _mm and _g.
Code
# A tibble: 1 × 4
mean_bill_length mean_bill_depth mean_flipper_length mean_body_mass
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
Another option is to include the .fn in the names function. But unfortunately it did not do the trick because the prefix mean was removed. Instead a prexif 1_ was added .
Code
# A tibble: 1 × 4
`1_bill_length` `1_bill_depth` `1_flipper_length` `1_body_mass`
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
To make it work correctly, we need to use .fns = list(mean = mean). This is the best option to display the names of the summary statistics called for.
Code
# A tibble: 1 × 4
avg_bill_length avg_bill_depth avg_flipper_length avg_body_mass
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
You can compute more summary statistics by adding estimation of standard deviation.
Code
# A tibble: 1 × 8
avg_bill_length standard_deviation_bil…¹ avg_bill_depth standard_deviation_b…²
<dbl> <dbl> <dbl> <dbl>
1 44.0 5.47 17.2 1.97
# ℹ abbreviated names: ¹standard_deviation_bill_length,
# ²standard_deviation_bill_depth
# ℹ 4 more variables: avg_flipper_length <dbl>,
# standard_deviation_flipper_length <dbl>, avg_body_mass <dbl>,
# standard_deviation_body_mass <dbl>
Lesson 9
Using across() + mutate( ) instead of repetitions
Using across() inside mutate() facilitates cleaning multiple columns in one single shot. In this example, we would be using across() along with the parse_number() to extract the first number inside a text.
We get a warning but the numbers were parsed or extracted from the strings. The end result is a tibble table with numbers instead of characters.
parse_number extracts the number of any string whose text has numbers embedded into them.
Code
# A tibble: 2 × 2
text1 text2
<dbl> <dbl>
1 10 30
2 20 40
If you want to include both numbers and text, you need to add a new column as follows:
Code
# A tibble: 2 × 4
text1 text2 number_text1 number_text2
<chr> <chr> <dbl> <dbl>
1 I have 10 apples I have 30 apples 10 30
2 I have 20 appples I have 40 apples 20 40
Lesson 10
Adding calculations next to across()
Recapping a prior example below. The operations listed after summarize is just one of the calculations that could be further added. Below the str_remove() removes text listed within parenthesis across all columns.
Code
# A tibble: 1 × 4
mean_bill_length mean_bill_depth mean_flipper_length mean_body_mass_g
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
Code
# A tibble: 1 × 4
Avg_bill_length Avg_bill_depth Avg_flipper_length Avg_body_mass_g
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
Adding more operations to summarise(). Say you want to add a standard deviation, say sd for mean_bill_length. All what you need to do is to add a line.
Code
# A tibble: 1 × 5
mean_bill_length mean_bill_depth mean_flipper_length mean_body_mass_g
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
# ℹ 1 more variable: sd_bill_length <dbl>
Code
# A tibble: 1 × 5
mean_bill_length mean_bill_depth mean_flipper_length mean_body_mass_g
<dbl> <dbl> <dbl> <dbl>
1 44.0 17.2 201. 4207.
# ℹ 1 more variable: sd_bill_length <dbl>
Lesson 11
Grouped calculations with across()
In the last lesson we learned that not only we can use across() for multiple computations, but also we can combine it with other analyses. This means we could do across calculations on a group basis as well. In short, we can perform powerful estimations in a concise manner.
In the example below, we can add the estimation of means and standard deviations by species and island.
Code
# A tibble: 5 × 7
species island mean_bill_length mean_bill_depth mean_flipper_length
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 18.5 192.
2 Adelie Biscoe 39.0 18.4 189.
3 Adelie Dream 38.5 18.2 190.
4 Gentoo Biscoe 47.6 15.0 217.
5 Chinstrap Dream 48.8 18.4 196.
# ℹ 2 more variables: mean_body_mass_g <dbl>, sd_bill_length <dbl>
Code
# A tibble: 5 × 7
species island mean_bill_length mean_bill_depth mean_flipper_length
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 18.5 192.
2 Adelie Biscoe 39.0 18.4 189.
3 Adelie Dream 38.5 18.2 190.
4 Gentoo Biscoe 47.6 15.0 217.
5 Chinstrap Dream 48.8 18.4 196.
# ℹ 2 more variables: mean_body_mass_g <dbl>, sd_bill_length <dbl>
An extension of this idea also apply to names =. See below
Code
# A tibble: 5 × 11
species island avg_bill_length standard_deviation_bill_l…¹ avg_bill_depth
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 3.03 18.5
2 Adelie Biscoe 39.0 2.48 18.4
3 Adelie Dream 38.5 2.48 18.2
4 Gentoo Biscoe 47.6 3.11 15.0
5 Chinstrap Dream 48.8 3.34 18.4
# ℹ abbreviated name: ¹standard_deviation_bill_length
# ℹ 6 more variables: standard_deviation_bill_depth <dbl>,
# avg_flipper_length <dbl>, standard_deviation_flipper_length <dbl>,
# avg_body_mass <dbl>, standard_deviation_body_mass <dbl>,
# sd_bill_lenght <dbl>
Lesson 12
Getting to know the reframe() function
reframe() is a function sitting between mutate and summarize. This function has an arbitrary length between one, which is associated with summarise() and multiple, which is associated with mutate().
The range() function is relevant. It tells you the minimum and maximum values of a column.
If we use range() within summarise() we get an error message, asking to use reframe() instead.
Code
range(penguins_wo_NAs$bill_length_mm)[1] 32.1 59.6
Code
# A tibble: 2 × 1
range_bill_lenght_mm
<dbl>
1 32.1
2 59.6
Since reframe() along range() would return unnamed results, we could add type() for adding the labels to be added into the output.
Code
# A tibble: 2 × 2
type range_bill_lenght_mm
<chr> <dbl>
1 min 32.1
2 max 59.6
We can do more things with reframe( ). We can use group statistics as well.
Code
# A tibble: 6 × 3
species type range_bill_lenght_mm
<fct> <chr> <dbl>
1 Adelie min 32.1
2 Adelie max 46
3 Gentoo min 40.9
4 Gentoo max 59.6
5 Chinstrap min 40.9
6 Chinstrap max 58
We can also add across() and request range() within .fns =. This approach insert the across( ) function within the reframe( )function.
Code
# A tibble: 6 × 6
species type bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <chr> <dbl> <dbl> <int> <int>
1 Adelie min 32.1 15.5 172 2850
2 Adelie max 46 21.5 210 4775
3 Gentoo min 40.9 13.1 203 3950
4 Gentoo max 59.6 17.3 231 6300
5 Chinstrap min 40.9 16.4 178 2700
6 Chinstrap max 58 20.8 212 4800
But we could also format it as a table. To do so, we need to call upon pivot _longer( ) function as well as the and pivot_wider( ) function.
Code
#Pivot longer
penguins_wo_NAs |>
reframe(
type = c("min", "max"),
across(.cols = c("bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g"),
.fns = range
),
.by = species) |>
pivot_longer(
cols = -c(1:2),
names_to = "quantity",
values_to = 'value'
) |>
pivot_wider(
id_cols = c(quantity),
names_from = c(species,type),
values_from = value
)# A tibble: 4 × 7
quantity Adelie_min Adelie_max Gentoo_min Gentoo_max Chinstrap_min
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bill_length_mm 32.1 46 40.9 59.6 40.9
2 bill_depth_mm 15.5 21.5 13.1 17.3 16.4
3 flipper_length_mm 172 210 203 231 178
4 body_mass_g 2850 4775 3950 6300 2700
# ℹ 1 more variable: Chinstrap_max <dbl>
Lesson 13
Motivation for tidyselect helpers. They are powerful techniques. Allows one to select the correct columns or variables. For instance, the .fns = allows one to label the variables. Then those labels are collected via the .names = function (see below):
Code
# A tibble: 5 × 10
species island avg_bill_length standard_deviation_bill_l…¹ avg_bill_depth
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 3.03 18.5
2 Adelie Biscoe 39.0 2.48 18.4
3 Adelie Dream 38.5 2.48 18.2
4 Gentoo Biscoe 47.6 3.11 15.0
5 Chinstrap Dream 48.8 3.34 18.4
# ℹ abbreviated name: ¹standard_deviation_bill_length
# ℹ 5 more variables: standard_deviation_bill_depth <dbl>,
# avg_flipper_length <dbl>, standard_deviation_flipper_length <dbl>,
# avg_body_mass_g <dbl>, standard_deviation_body_mass_g <dbl>
Instead of listing the columns where the means and standard deviations are to be computed, we could use tidy helpers to facilitate locating them. The tidyselect to be used is where. It works in combination with another function, is.numeric(), that finds the desired property of the column or variable.
Code
# A tibble: 5 × 7
species island avg_bill_length avg_bill_depth avg_flipper_length
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 18.5 192.
2 Adelie Biscoe 39.0 18.4 189.
3 Adelie Dream 38.5 18.2 190.
4 Gentoo Biscoe 47.6 15.0 217.
5 Chinstrap Dream 48.8 18.4 196.
# ℹ 2 more variables: avg_body_mass_g <dbl>, avg_year <dbl>
Notice this omnibus statement includes the column year, which is categorical. In other to exclude this kind of variable from the omnibus statement we can add the -(minus) to exclude the categorical variable. Notice we need to encircle it within the concatenate function c( ) . The script c(where(is.numeric), -year) concatenates the tidy helper while excluding the column year.
Code
# A tibble: 5 × 6
species island avg_bill_length avg_bill_depth avg_flipper_length
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 18.5 192.
2 Adelie Biscoe 39.0 18.4 189.
3 Adelie Dream 38.5 18.2 190.
4 Gentoo Biscoe 47.6 15.0 217.
5 Chinstrap Dream 48.8 18.4 196.
# ℹ 1 more variable: avg_body_mass_g <dbl>
Lesson 14
Another custom function for across() function. Using the original penguins dataset. Performing the function, you will get missing data.
How to avoid these situations of ending up with NAs. We can use the mean function with na.rm option. The way to do so is to specify a new function that also specifies na.rm = TRUE. We can do so via an anonymous function(x)
Two options:
Code
# A tibble: 5 × 6
species island mean_bill_length mean_bill_depth mean_flipper_length
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 18.4 191.
2 Adelie Biscoe 39.0 18.4 189.
3 Adelie Dream 38.5 18.3 190.
4 Gentoo Biscoe 47.5 15.0 217.
5 Chinstrap Dream 48.8 18.4 196.
# ℹ 1 more variable: mean_body_mass_g <dbl>
Code
# A tibble: 5 × 6
species island mean_bill_length mean_bill_depth mean_flipper_length
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 18.4 191.
2 Adelie Biscoe 39.0 18.4 189.
3 Adelie Dream 38.5 18.3 190.
4 Gentoo Biscoe 47.5 15.0 217.
5 Chinstrap Dream 48.8 18.4 196.
# ℹ 1 more variable: mean_body_mass_g <dbl>
Code
# A tibble: 5 × 6
species island avg_bill_length avg_bill_depth avg_flipper_length
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.0 18.4 191.
2 Adelie Biscoe 39.0 18.4 189.
3 Adelie Dream 38.5 18.3 190.
4 Gentoo Biscoe 47.5 15.0 217.
5 Chinstrap Dream 48.8 18.4 196.
# ℹ 1 more variable: avg_body_mass_g <dbl>
Lesson 15
This lesson shows the most important tidy select helpers.
Selecting the right data by column/variable types. Using select along with where(is.numeric) screens the database in search of variables that are numeric. We can eliminate non numeric variables as well. In the example below, we exclude the column year.
The where() function can be used in several scenarios. One is to select numeric variables. Another for selecting factor variables. And finally selecting both factor and character variables.
Using where( ) function to select variables by measurement (numeric, factor, character)
In this example we only want to include numeric variables. We are excluding year. This variable although coded as numerical is at the end categorical.
Code
# A tibble: 333 × 4
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<dbl> <dbl> <int> <int>
1 39.1 18.7 181 3750
2 39.5 17.4 186 3800
3 40.3 18 195 3250
4 36.7 19.3 193 3450
5 39.3 20.6 190 3650
6 38.9 17.8 181 3625
7 39.2 19.6 195 4675
8 41.1 17.6 182 3200
9 38.6 21.2 191 3800
10 34.6 21.1 198 4400
# ℹ 323 more rows
# A tibble: 333 × 3
species island sex
<fct> <fct> <fct>
1 Adelie Torgersen male
2 Adelie Torgersen female
3 Adelie Torgersen female
4 Adelie Torgersen female
5 Adelie Torgersen male
6 Adelie Torgersen female
7 Adelie Torgersen male
8 Adelie Torgersen female
9 Adelie Torgersen male
10 Adelie Torgersen male
# ℹ 323 more rows
Code
# A tibble: 333 × 3
species island sex
<fct> <fct> <fct>
1 Adelie Torgersen male
2 Adelie Torgersen female
3 Adelie Torgersen female
4 Adelie Torgersen female
5 Adelie Torgersen male
6 Adelie Torgersen female
7 Adelie Torgersen male
8 Adelie Torgersen female
9 Adelie Torgersen male
10 Adelie Torgersen male
# ℹ 323 more rows
Using anonymous functions inside where()
Alternatively, you can define a custom function. Inside the where() function you can use the anonymous function. Combine the two with an or ||.
Consequently the defined function is TRUE if the corresponding column/variable is a factor or a character. If numeric, it would return a FALSE.
Code
penguins_wo_NAs |>
select(where(function(x) is.factor(x) || is.character(x)) )# A tibble: 333 × 3
species island sex
<fct> <fct> <fct>
1 Adelie Torgersen male
2 Adelie Torgersen female
3 Adelie Torgersen female
4 Adelie Torgersen female
5 Adelie Torgersen male
6 Adelie Torgersen female
7 Adelie Torgersen male
8 Adelie Torgersen female
9 Adelie Torgersen male
10 Adelie Torgersen male
# ℹ 323 more rows
Anonymous functions and if conditions for selecting variables
Defining custom function to identify numeric columns. And we could add another condition such as identifying the mean to be above a particular value.
So here we defined an anonymous function that checks whether a variable is numeric and where the mean of the variable is above 200.
The difference between the double && and the single & operator is that the single operator generates a warning message. The argument is not numeric or logical.
Code
penguins_wo_NAs |>
select(where(function(x) is.numeric(x) && mean(x) > 200 ) )# A tibble: 333 × 3
flipper_length_mm body_mass_g year
<int> <int> <int>
1 181 3750 2007
2 186 3800 2007
3 195 3250 2007
4 193 3450 2007
5 190 3650 2007
6 181 3625 2007
7 195 4675 2007
8 182 3200 2007
9 191 3800 2007
10 198 4400 2007
# ℹ 323 more rows
Anonymous function to identify NAs
Another handy option is to get columns where specific percentage of values are missing. In this case asking for variables with more than 1% of missing cases.
It usually practical to name this function for future use.
Creating function to identify variables with missing cases
Sometimes is better to save a function and give it a name for future use. The name should tell what the function does. In this case, we want to save a function that identifies the columns or variables having more than 1% missing cases. We label it more_than_one_percent_missing.
Code
# A tibble: 344 × 1
sex
<fct>
1 male
2 female
3 female
4 <NA>
5 female
6 male
7 female
8 male
9 <NA>
10 <NA>
# ℹ 334 more rows
Lesson 16
This lesson further examines tidy helpers for selecting columns based on names or strings embedded into the column/variable.
Select the right data by column name
Selecting consecutive columns. Between bill_length and sex column
-
Selecting columns by specific variable’s labels.
- Selecting variables ending with
_mm
- Selecting variables ending with
-
Selecting columns with start with a particular string.
- Selecting variables starting with the string
bill.
- Selecting variables starting with the string
-
Selecting columns that contain similar strings.
- Selecting variables with the string
lenght
- Selecting variables with the string
Selecting consecutive columns
Selecting columns from bill_length_mm to sex.
Code
#Selecting consecutive columns
penguins_wo_NAs |>
select(bill_length_mm:sex)# A tibble: 333 × 5
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
<dbl> <dbl> <int> <int> <fct>
1 39.1 18.7 181 3750 male
2 39.5 17.4 186 3800 female
3 40.3 18 195 3250 female
4 36.7 19.3 193 3450 female
5 39.3 20.6 190 3650 male
6 38.9 17.8 181 3625 female
7 39.2 19.6 195 4675 male
8 41.1 17.6 182 3200 female
9 38.6 21.2 191 3800 male
10 34.6 21.1 198 4400 male
# ℹ 323 more rows
Selecting columns ending with a particular string
# A tibble: 333 × 3
bill_length_mm bill_depth_mm flipper_length_mm
<dbl> <dbl> <int>
1 39.1 18.7 181
2 39.5 17.4 186
3 40.3 18 195
4 36.7 19.3 193
5 39.3 20.6 190
6 38.9 17.8 181
7 39.2 19.6 195
8 41.1 17.6 182
9 38.6 21.2 191
10 34.6 21.1 198
# ℹ 323 more rows
Selecting columns that start with a particular string
Selecting variables containing the string bill.
Code
#Selecting columns starting with bill
penguins_wo_NAs |>
select(starts_with('bill'))# A tibble: 333 × 2
bill_length_mm bill_depth_mm
<dbl> <dbl>
1 39.1 18.7
2 39.5 17.4
3 40.3 18
4 36.7 19.3
5 39.3 20.6
6 38.9 17.8
7 39.2 19.6
8 41.1 17.6
9 38.6 21.2
10 34.6 21.1
# ℹ 323 more rows
Selecting columns contain a particular string
Extending gamma and scope column search
Using regular expressions to extend the scope of the search for variables. For instance searching for variables that start with the letters f or b, and are followed by any amount of characters .* and then end ($) with the string _mm. The corresponding regular expression is: ^[fb].*(_mm)$
# A tibble: 333 × 3
bill_length_mm bill_depth_mm flipper_length_mm
<dbl> <dbl> <int>
1 39.1 18.7 181
2 39.5 17.4 186
3 40.3 18 195
4 36.7 19.3 193
5 39.3 20.6 190
6 38.9 17.8 181
7 39.2 19.6 195
8 41.1 17.6 182
9 38.6 21.2 191
10 34.6 21.1 198
# ℹ 323 more rows
Two more ways to select columns/variables based on a vector that contains the names. You can use a vector to select the columns or variables of interest. Two optional functions any_of() or all_of() functions to select the components of the vector.
Selecting all function
Code
# A tibble: 344 × 2
bill_length_mm bill_depth_mm
<dbl> <dbl>
1 39.1 18.7
2 39.5 17.4
3 40.3 18
4 NA NA
5 36.7 19.3
6 39.3 20.6
7 38.9 17.8
8 39.2 19.6
9 34.1 18.1
10 42 20.2
# ℹ 334 more rows
Selecting any of function
any_of( ) tidy helper is preferred to the all_of( ) tidy helper. It helps to discriminate the variables you want to search when you list includes many columns.
Code
# A tibble: 333 × 2
bill_length_mm bill_depth_mm
<dbl> <dbl>
1 39.1 18.7
2 39.5 17.4
3 40.3 18
4 36.7 19.3
5 39.3 20.6
6 38.9 17.8
7 39.2 19.6
8 41.1 17.6
9 38.6 21.2
10 34.6 21.1
# ℹ 323 more rows
Lesson 17
Bringing it all together.
Using the housing dataset from the ames database from modeldata library.
Code
| Name | clean_names(ames) |
| Number of rows | 2930 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| factor | 40 |
| numeric | 34 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| ms_sub_class | 0 | 1 | FALSE | 16 | One: 1079, Two: 575, One: 287, One: 192 |
| ms_zoning | 0 | 1 | FALSE | 7 | Res: 2273, Res: 462, Flo: 139, Res: 27 |
| street | 0 | 1 | FALSE | 2 | Pav: 2918, Grv: 12 |
| alley | 0 | 1 | FALSE | 3 | No_: 2732, Gra: 120, Pav: 78 |
| lot_shape | 0 | 1 | FALSE | 4 | Reg: 1859, Sli: 979, Mod: 76, Irr: 16 |
| land_contour | 0 | 1 | FALSE | 4 | Lvl: 2633, HLS: 120, Bnk: 117, Low: 60 |
| utilities | 0 | 1 | FALSE | 3 | All: 2927, NoS: 2, NoS: 1 |
| lot_config | 0 | 1 | FALSE | 5 | Ins: 2140, Cor: 511, Cul: 180, FR2: 85 |
| land_slope | 0 | 1 | FALSE | 3 | Gtl: 2789, Mod: 125, Sev: 16 |
| neighborhood | 0 | 1 | FALSE | 28 | Nor: 443, Col: 267, Old: 239, Edw: 194 |
| condition_1 | 0 | 1 | FALSE | 9 | Nor: 2522, Fee: 164, Art: 92, RRA: 50 |
| condition_2 | 0 | 1 | FALSE | 8 | Nor: 2900, Fee: 13, Art: 5, Pos: 4 |
| bldg_type | 0 | 1 | FALSE | 5 | One: 2425, Twn: 233, Dup: 109, Twn: 101 |
| house_style | 0 | 1 | FALSE | 8 | One: 1481, Two: 873, One: 314, SLv: 128 |
| overall_cond | 0 | 1 | FALSE | 9 | Ave: 1654, Abo: 533, Goo: 390, Ver: 144 |
| roof_style | 0 | 1 | FALSE | 6 | Gab: 2321, Hip: 551, Gam: 22, Fla: 20 |
| roof_matl | 0 | 1 | FALSE | 8 | Com: 2887, Tar: 23, WdS: 9, WdS: 7 |
| exterior_1st | 0 | 1 | FALSE | 16 | Vin: 1026, Met: 450, HdB: 442, Wd : 420 |
| exterior_2nd | 0 | 1 | FALSE | 17 | Vin: 1015, Met: 447, HdB: 406, Wd : 397 |
| mas_vnr_type | 0 | 1 | FALSE | 5 | Non: 1775, Brk: 880, Sto: 249, Brk: 25 |
| exter_cond | 0 | 1 | FALSE | 5 | Typ: 2549, Goo: 299, Fai: 67, Exc: 12 |
| foundation | 0 | 1 | FALSE | 6 | PCo: 1310, CBl: 1244, Brk: 311, Sla: 49 |
| bsmt_cond | 0 | 1 | FALSE | 6 | Typ: 2616, Goo: 122, Fai: 104, No_: 80 |
| bsmt_exposure | 0 | 1 | FALSE | 5 | No: 1906, Av: 418, Gd: 284, Mn: 239 |
| bsmt_fin_type_1 | 0 | 1 | FALSE | 7 | GLQ: 859, Unf: 851, ALQ: 429, Rec: 288 |
| bsmt_fin_type_2 | 0 | 1 | FALSE | 7 | Unf: 2499, Rec: 106, LwQ: 89, No_: 81 |
| heating | 0 | 1 | FALSE | 6 | Gas: 2885, Gas: 27, Gra: 9, Wal: 6 |
| heating_qc | 0 | 1 | FALSE | 5 | Exc: 1495, Typ: 864, Goo: 476, Fai: 92 |
| central_air | 0 | 1 | FALSE | 2 | Y: 2734, N: 196 |
| electrical | 0 | 1 | FALSE | 6 | SBr: 2682, Fus: 188, Fus: 50, Fus: 8 |
| functional | 0 | 1 | FALSE | 8 | Typ: 2728, Min: 70, Min: 65, Mod: 35 |
| garage_type | 0 | 1 | FALSE | 7 | Att: 1731, Det: 782, Bui: 186, No_: 157 |
| garage_finish | 0 | 1 | FALSE | 4 | Unf: 1231, RFn: 812, Fin: 728, No_: 159 |
| garage_cond | 0 | 1 | FALSE | 6 | Typ: 2665, No_: 159, Fai: 74, Goo: 15 |
| paved_drive | 0 | 1 | FALSE | 3 | Pav: 2652, Dir: 216, Par: 62 |
| pool_qc | 0 | 1 | FALSE | 5 | No_: 2917, Exc: 4, Goo: 4, Typ: 3 |
| fence | 0 | 1 | FALSE | 5 | No_: 2358, Min: 330, Goo: 118, Goo: 112 |
| misc_feature | 0 | 1 | FALSE | 6 | Non: 2824, She: 95, Gar: 5, Oth: 4 |
| sale_type | 0 | 1 | FALSE | 10 | WD : 2536, New: 239, COD: 87, Con: 26 |
| sale_condition | 0 | 1 | FALSE | 6 | Nor: 2413, Par: 245, Abn: 190, Fam: 46 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| lot_frontage | 0 | 1 | 57.65 | 33.50 | 0.00 | 43.00 | 63.00 | 78.00 | 313.00 | ▇▇▁▁▁ |
| lot_area | 0 | 1 | 10147.92 | 7880.02 | 1300.00 | 7440.25 | 9436.50 | 11555.25 | 215245.00 | ▇▁▁▁▁ |
| year_built | 0 | 1 | 1971.36 | 30.25 | 1872.00 | 1954.00 | 1973.00 | 2001.00 | 2010.00 | ▁▂▃▆▇ |
| year_remod_add | 0 | 1 | 1984.27 | 20.86 | 1950.00 | 1965.00 | 1993.00 | 2004.00 | 2010.00 | ▅▂▂▃▇ |
| mas_vnr_area | 0 | 1 | 101.10 | 178.63 | 0.00 | 0.00 | 0.00 | 162.75 | 1600.00 | ▇▁▁▁▁ |
| bsmt_fin_sf_1 | 0 | 1 | 4.18 | 2.23 | 0.00 | 3.00 | 3.00 | 7.00 | 7.00 | ▃▂▇▁▇ |
| bsmt_fin_sf_2 | 0 | 1 | 49.71 | 169.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1526.00 | ▇▁▁▁▁ |
| bsmt_unf_sf | 0 | 1 | 559.07 | 439.54 | 0.00 | 219.00 | 465.50 | 801.75 | 2336.00 | ▇▅▂▁▁ |
| total_bsmt_sf | 0 | 1 | 1051.26 | 440.97 | 0.00 | 793.00 | 990.00 | 1301.50 | 6110.00 | ▇▃▁▁▁ |
| first_flr_sf | 0 | 1 | 1159.56 | 391.89 | 334.00 | 876.25 | 1084.00 | 1384.00 | 5095.00 | ▇▃▁▁▁ |
| second_flr_sf | 0 | 1 | 335.46 | 428.40 | 0.00 | 0.00 | 0.00 | 703.75 | 2065.00 | ▇▃▂▁▁ |
| gr_liv_area | 0 | 1 | 1499.69 | 505.51 | 334.00 | 1126.00 | 1442.00 | 1742.75 | 5642.00 | ▇▇▁▁▁ |
| bsmt_full_bath | 0 | 1 | 0.43 | 0.52 | 0.00 | 0.00 | 0.00 | 1.00 | 3.00 | ▇▆▁▁▁ |
| bsmt_half_bath | 0 | 1 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▁ |
| full_bath | 0 | 1 | 1.57 | 0.55 | 0.00 | 1.00 | 2.00 | 2.00 | 4.00 | ▁▇▇▁▁ |
| half_bath | 0 | 1 | 0.38 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 2.00 | ▇▁▅▁▁ |
| bedroom_abv_gr | 0 | 1 | 2.85 | 0.83 | 0.00 | 2.00 | 3.00 | 3.00 | 8.00 | ▁▇▂▁▁ |
| kitchen_abv_gr | 0 | 1 | 1.04 | 0.21 | 0.00 | 1.00 | 1.00 | 1.00 | 3.00 | ▁▇▁▁▁ |
| tot_rms_abv_grd | 0 | 1 | 6.44 | 1.57 | 2.00 | 5.00 | 6.00 | 7.00 | 15.00 | ▁▇▂▁▁ |
| fireplaces | 0 | 1 | 0.60 | 0.65 | 0.00 | 0.00 | 1.00 | 1.00 | 4.00 | ▇▇▁▁▁ |
| garage_cars | 0 | 1 | 1.77 | 0.76 | 0.00 | 1.00 | 2.00 | 2.00 | 5.00 | ▅▇▂▁▁ |
| garage_area | 0 | 1 | 472.66 | 215.19 | 0.00 | 320.00 | 480.00 | 576.00 | 1488.00 | ▃▇▃▁▁ |
| wood_deck_sf | 0 | 1 | 93.75 | 126.36 | 0.00 | 0.00 | 0.00 | 168.00 | 1424.00 | ▇▁▁▁▁ |
| open_porch_sf | 0 | 1 | 47.53 | 67.48 | 0.00 | 0.00 | 27.00 | 70.00 | 742.00 | ▇▁▁▁▁ |
| enclosed_porch | 0 | 1 | 23.01 | 64.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1012.00 | ▇▁▁▁▁ |
| three_season_porch | 0 | 1 | 2.59 | 25.14 | 0.00 | 0.00 | 0.00 | 0.00 | 508.00 | ▇▁▁▁▁ |
| screen_porch | 0 | 1 | 16.00 | 56.09 | 0.00 | 0.00 | 0.00 | 0.00 | 576.00 | ▇▁▁▁▁ |
| pool_area | 0 | 1 | 2.24 | 35.60 | 0.00 | 0.00 | 0.00 | 0.00 | 800.00 | ▇▁▁▁▁ |
| misc_val | 0 | 1 | 50.64 | 566.34 | 0.00 | 0.00 | 0.00 | 0.00 | 17000.00 | ▇▁▁▁▁ |
| mo_sold | 0 | 1 | 6.22 | 2.71 | 1.00 | 4.00 | 6.00 | 8.00 | 12.00 | ▅▆▇▃▃ |
| year_sold | 0 | 1 | 2007.79 | 1.32 | 2006.00 | 2007.00 | 2008.00 | 2009.00 | 2010.00 | ▇▇▇▇▃ |
| sale_price | 0 | 1 | 180796.06 | 79886.69 | 12789.00 | 129500.00 | 160000.00 | 213500.00 | 755000.00 | ▇▇▁▁▁ |
| longitude | 0 | 1 | -93.64 | 0.03 | -93.69 | -93.66 | -93.64 | -93.62 | -93.58 | ▅▅▇▆▁ |
| latitude | 0 | 1 | 42.03 | 0.02 | 41.99 | 42.02 | 42.03 | 42.05 | 42.06 | ▂▂▇▇▇ |
Code
#Examining a couple of numeric variables using gt_plt_summary
ames |>
clean_names() |>
select(sale_price,garage_cars) |>
gt_plt_summary()| select(clean_names(ames), sale_price, garage_cars) | ||||||
|---|---|---|---|---|---|---|
| 2930 rows x 2 cols | ||||||
| Column | Plot Overview | Missing | Mean | Median | SD | |
| sale_price | 0.0% | 180,796.1 | 160,000.0 | 79,886.7 | ||
| garage_cars | 0.0% | 1.8 | 2.0 | 0.8 | ||
Transforming columns as tibble
Focusing our attention to numeric variables. We can also transform the variables a tibble while accessing the ames file. We select those ames’ variables that are numeric.
Code
# A tibble: 34 × 12
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd
<chr> <chr> <int> <dbl> <dbl> <dbl>
1 numeric lot_frontage 0 1 57.6 33.5
2 numeric lot_area 0 1 10148. 7880.
3 numeric year_built 0 1 1971. 30.2
4 numeric year_remod_add 0 1 1984. 20.9
5 numeric mas_vnr_area 0 1 101. 179.
6 numeric bsmt_fin_sf_1 0 1 4.18 2.23
7 numeric bsmt_fin_sf_2 0 1 49.7 169.
8 numeric bsmt_unf_sf 0 1 559. 440.
9 numeric total_bsmt_sf 0 1 1051. 441.
10 numeric first_flr_sf 0 1 1160. 392.
# ℹ 24 more rows
# ℹ 6 more variables: numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>,
# numeric.p75 <dbl>, numeric.p100 <dbl>, numeric.hist <chr>
Conducting summary statistics by focusing on variables that are counts. To do so we define a custom function that checks for those variables that are counts. Notice the function screens wether the values of the variable range from 0 to 10. This anonymous function would give us a string of FALSE or TRUE values. Using the all() function to check if the values of the variables range from 0 to 10. Notice the tidy helper all_of() is not applied, just all
Next we can use this information to identify continuous variables. The ames_numeric anonymous function identified 9 count variables. If you want to identify continuous variables negate the !all(x %in% 0:10). This revised anonymous function yielded 25 variables that are continuous.
Code
# A tibble: 2,930 × 9
bsmt_fin_sf_1 bsmt_full_bath bsmt_half_bath full_bath half_bath
<dbl> <dbl> <dbl> <int> <int>
1 2 1 0 1 0
2 6 0 0 1 0
3 1 0 0 1 1
4 1 1 0 2 1
5 3 0 0 2 1
6 3 0 0 2 1
7 3 1 0 2 0
8 1 0 0 2 0
9 3 1 0 2 0
10 7 0 0 2 1
# ℹ 2,920 more rows
# ℹ 4 more variables: bedroom_abv_gr <int>, kitchen_abv_gr <int>,
# fireplaces <int>, garage_cars <dbl>
Using the data subset of continuous variables. Let’s focus on columns that contain the word area. Next identify variables with the sf string
# A tibble: 2,930 × 5
lot_area mas_vnr_area gr_liv_area garage_area pool_area
<int> <dbl> <int> <dbl> <int>
1 31770 112 1656 528 0
2 11622 0 896 730 0
3 14267 108 1329 312 0
4 11160 0 2110 522 0
5 13830 0 1629 482 0
6 9978 20 1604 470 0
7 4920 0 1338 582 0
8 5005 0 1280 506 0
9 5389 0 1616 608 0
10 7500 0 1804 442 0
# ℹ 2,920 more rows
# A tibble: 2,930 × 7
bsmt_fin_sf_2 bsmt_unf_sf total_bsmt_sf first_flr_sf second_flr_sf
<dbl> <dbl> <dbl> <int> <int>
1 0 441 1080 1656 0
2 144 270 882 896 0
3 0 406 1329 1329 0
4 0 1045 2110 2110 0
5 0 137 928 928 701
6 0 324 926 926 678
7 0 722 1338 1338 0
8 0 1017 1280 1280 0
9 0 415 1595 1616 0
10 0 994 994 1028 776
# ℹ 2,920 more rows
# ℹ 2 more variables: wood_deck_sf <int>, open_porch_sf <int>
Eyeballing, it appears that garage and living area variables are the product of adding first floor square feet and second floor square feet. If this hypothesis is true excluding observations that result in adding these two variables would produce an empty rows across all variables.
The hypothesis was true for some rows or cases but not for 40 cases.
# A tibble: 40 × 25
lot_frontage lot_area year_built year_remod_add mas_vnr_area bsmt_fin_sf_2
<dbl> <int> <int> <int> <dbl> <dbl>
1 50 5500 1929 2001 0 692
2 50 9350 1947 1979 0 0
3 60 10410 1916 1987 0 0
4 60 10410 1915 1950 0 0
5 60 12144 1949 1950 0 0
6 60 8064 1948 1994 0 0
7 98 8820 1890 1996 0 0
8 60 5400 1920 1950 0 0
9 92 5520 1912 1950 0 0
10 52 6240 1925 1950 0 0
# ℹ 30 more rows
# ℹ 19 more variables: bsmt_unf_sf <dbl>, total_bsmt_sf <dbl>,
# first_flr_sf <int>, second_flr_sf <int>, gr_liv_area <int>,
# tot_rms_abv_grd <int>, garage_area <dbl>, wood_deck_sf <int>,
# open_porch_sf <int>, enclosed_porch <int>, three_season_porch <int>,
# screen_porch <int>, pool_area <int>, misc_val <int>, mo_sold <int>,
# year_sold <int>, sale_price <int>, longitude <dbl>, latitude <dbl>
Sale price and calculating averages
Focusing on sale price.
Code
ames |>
skim(sale_price)| Name | ames |
| Number of rows | 2930 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| sale_price | 0 | 1 | 180796.1 | 79886.69 | 12789 | 129500 | 160000 | 213500 | 755000 | ▇▇▁▁▁ |
Code
#* Checking sale price distribution as well. Notice median < mean
#* suggesting the distribution is posotively skewed.
ames |>
select(sale_price) |>
gt_plt_summary()| select(ames, sale_price) | ||||||
|---|---|---|---|---|---|---|
| 2930 rows x 1 cols | ||||||
| Column | Plot Overview | Missing | Mean | Median | SD | |
| sale_price | 0.0% | 180,796.1 | 160,000.0 | 79,886.7 | ||
Calculating average prices
Notice, Albert introduces the function to request skewness. Skewness belongs to the e1071 library. And to have it displayed in a nice format, we could pivot_longer.
Using across to estimate min, max, average, median, sd, and skeweness.
To have a nice display of the summary statistics, we could opt for pivoting longer the dataset.
In pivoting_longer, we opt to arrange all of the columns using cols = everything(). Notice the median is lower than the mean signifying the distribution is positively skewed.
The end result is a summary statistics table
Code
library(tidyverse)
library(gt)
#Calculating averages using across
ames |>
summarise(
across(
.cols = sale_price,
.fns = list(
min = min,
avg = mean,
median = median,
max = max,
standard_dev = sd,
skewness = e1071::skewness),
.names = '{.fn}'
)
) |>
#Pivoting longer to create a summary stats table
pivot_longer(
cols = everything()
)# A tibble: 6 × 2
name value
<chr> <dbl>
1 min 12789
2 avg 180796.
3 median 160000
4 max 755000
5 standard_dev 79887.
6 skewness 1.74
Focusing on the neighborhood column
The neighborhood column has 28 values, which makes it difficult to handle. Consequently, we could lump together values. The functionfct_lump_n() lumps all levels except for the n most frequent (or least frequent if n < 0). In this case, we are focusing on the top 5 most frequent neighborhoods lumping together the rest. It produces 6 neighborhoods. Other is a collection of many neighborhoods.
Code
# A tibble: 28 × 2
neighborhood n
<fct> <int>
1 North_Ames 443
2 College_Creek 267
3 Old_Town 239
4 Edwards 194
5 Somerset 182
6 Northridge_Heights 166
7 Gilbert 165
8 Sawyer 151
9 Northwest_Ames 131
10 Sawyer_West 125
11 Mitchell 114
12 Brookside 108
13 Crawford 103
14 Iowa_DOT_and_Rail_Road 93
15 Timberland 72
16 Northridge 71
17 Stone_Brook 51
18 South_and_West_of_Iowa_State_University 48
19 Clear_Creek 44
20 Meadow_Village 37
21 Briardale 30
22 Bloomington_Heights 28
23 Veenker 24
24 Northpark_Villa 23
25 Blueste 10
26 Greens 8
27 Green_Hills 2
28 Landmark 1
Code
ames |>
mutate(neighborhood = fct_lump_n(
neighborhood,
n = 5)
) |>
skim(neighborhood)| Name | mutate(ames, neighborhood… |
| Number of rows | 2930 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| neighborhood | 0 | 1 | FALSE | 6 | Oth: 1605, Nor: 443, Col: 267, Old: 239 |
Code
ames_top_five <- ames |>
mutate(neighborhood = fct_lump_n(
neighborhood,
n = 5)
)
ames_top_five |>
select(neighborhood) |>
count(neighborhood) |>
print(n = Inf)# A tibble: 6 × 2
neighborhood n
<fct> <int>
1 North_Ames 443
2 College_Creek 267
3 Old_Town 239
4 Edwards 194
5 Somerset 182
6 Other 1605
Visualizing the data
With this new database, we can do data visualizations comparing the house prices across those 5 neighborhoods. We can use facet_wrap for the neighborhood. We can also use the scale_x_log10() to normalize the x-axis.
Code
ames_top_five |>
ggplot(aes(x = sale_price)) +
geom_density(fill = 'grey80') +
facet_wrap(~ neighborhood, nrow = 6) +
scale_x_log10(labels = scales::label_dollar()) +
theme_minimal(
base_family = 'Source Sans Pro',
base_size = 16
)Refraining data visualization with minimum and maximum sale price
We can make this figure more informative by adding labels for the minimum and maximum sale price. To do so, we can add a geom_text() layer. The values for the layer can be extracted from the function reframe() .
The reframe() function allows one to calculate the sale price range. It would give a tibble that has the range of prices. Next we can add the label type to capture the labels minimum and maximum via type = c("min", "max"). And lets create those values by neighborhood using the function .by = neighborhood.
We can save the values in a dataframe, say ranges. In short, the reframe( ) function generates a long data file containing the min and max sale prices across neighborhoods.
Code
# A tibble: 12 × 3
neighborhood type sale_price
<fct> <chr> <int>
1 North_Ames min 68000
2 North_Ames max 345000
3 Other min 13100
4 Other max 755000
5 Somerset min 139000
6 Somerset max 468000
7 Old_Town min 12789
8 Old_Town max 475000
9 Edwards min 35000
10 Edwards max 415000
11 College_Creek min 110000
12 College_Creek max 475000
Adding text layer
Modifying ggplot to add a geom_text() layer capturing the labels associated to each of the 5 neighborhoods.
In the geom_text(), we specify the data, the aes(). For y we specify -1 and for x we rely on sale_price’s ranges, and for label we refer to the variable containing the sale price. Notice we rely on y = -1 to place the labels below the chart.
Code
ames_top_five |>
ggplot(aes(x = sale_price)) +
geom_density(fill = 'grey80') +
geom_text(data = ranges,
aes(y = -1, label = sale_price)) +
facet_wrap(~ neighborhood, nrow = 6) +
scale_x_log10(labels = scales::label_dollar()) +
theme_minimal(
base_family = 'Source Sans Pro',
base_size = 16
)Improving the legend
Our first attempt with the labels produce characters too big. We can do this legend nicer by passing it to scales::dollar() function.
Code
ames_top_five |>
ggplot(aes(x = sale_price)) +
geom_density(fill = 'grey80') +
geom_text(data = ranges,
aes(y = -1, label = sale_price |>
scales::dollar()) ) +
facet_wrap(~ neighborhood, nrow = 6) +
scale_x_log10(labels = scales::label_dollar()) +
theme_minimal(
base_family = 'Source Sans Pro',
base_size = 16
)To make it more obvious what those labels stand for, we can add another geom_point() layer. It adds points to the min and max ranges in the chart. We can also shape as a triangle-like instead of a point (shape = 5).
Code
ames_top_five |>
ggplot(aes(x = sale_price)) +
geom_density(fill = 'grey80') +
geom_point(data = ranges,
aes(y = -0.3),
shape = 5,
size = 2) +
geom_text(data = ranges,
aes(y = -1.2, label = sale_price |>
scales::dollar()), size = 2.5 ) +
facet_wrap(~ neighborhood, nrow = 6) +
scale_x_log10(labels = scales::label_dollar()) +
theme_minimal(
base_family = 'Source Sans Pro',
base_size = 16
)Visualizing count variables
Another data visualization dealing with count variables. In this case, we identified 9 count variables. Let’s check the column names. There are variables that include kitchen and basement
[1] "bsmt_fin_sf_1" "bsmt_full_bath" "bsmt_half_bath" "full_bath"
[5] "half_bath" "bedroom_abv_gr" "kitchen_abv_gr" "fireplaces"
[9] "garage_cars"
We can reduce the number of count variables We end up with 7 variables instead of 9 by excluding variables related to basement and kitchen. Keeping variables with less than 6 values reduces the number of variables to 4.
As shown by skim, the values for the 4 variables range from 0 to 5.
Code
| Name | selected_counts |
| Number of rows | 2930 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| full_bath | 0 | 1 | 1.57 | 0.55 | 0 | 1 | 2 | 2 | 4 | ▁▇▇▁▁ |
| half_bath | 0 | 1 | 0.38 | 0.50 | 0 | 0 | 0 | 1 | 2 | ▇▁▅▁▁ |
| fireplaces | 0 | 1 | 0.60 | 0.65 | 0 | 0 | 1 | 1 | 4 | ▇▇▁▁▁ |
| garage_cars | 0 | 1 | 1.77 | 0.76 | 0 | 1 | 2 | 2 | 5 | ▅▇▂▁▁ |
We can mutate again with across(), and transforming the count variables into factor variables using the .fns = factor. Notice that for most variables the frequencies gravitate around 0 or 1. Consequently, we could lump together the variables.
Code
selected_counts |>
mutate(
across(
.cols = everything(),
.fns = factor )
) |>
skim()| Name | mutate(selected_counts, a… |
| Number of rows | 2930 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| factor | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| full_bath | 0 | 1 | FALSE | 5 | 2: 1532, 1: 1318, 3: 64, 0: 12 |
| half_bath | 0 | 1 | FALSE | 3 | 0: 1843, 1: 1062, 2: 25 |
| fireplaces | 0 | 1 | FALSE | 5 | 0: 1422, 1: 1274, 2: 221, 3: 12 |
| garage_cars | 0 | 1 | FALSE | 6 | 2: 1603, 1: 778, 3: 374, 0: 158 |
Lumping factor variables
Lumping together factor variables with limited range. Instead of using fct_lump_n, we could use if_else to create variables capturing the lump sum. In so doing, we can use across() along an anonymous or custom function that leaves numbers smaller than 2 as is, but numbers greater than 2 are replaced with the text 2+. The function to do so is if_else().
Code
# selected_counts |>
# mutate(
# across(
# .cols = everything(),
# .fns = function(x) if_else(x < 2, x, '2+')
# ) ) |>
# skim()Transforming numeric values into characters
Notice the function generated an error. Using ‘2+’ along with factor variables prompted R to report not being able to combine integers with characters in the same function.
Consequently, we need to transform all the value labels into a character format. We can do so by using as.character( ) function in the anonymous function. Notice the as.character(x), '2+' indicates that values equal or greater than 2 would be coded with the character ‘2+’ .
The skim() function reveals there are three unique values. The range goes from 0 to 1. Notice the character ‘2+’ is not listed in the range. However, skim reports 3 unique values, ‘2+’ being one of them (0, 1, “2+”).
Code
selected_counts |>
mutate(
across(
.cols = everything(),
.fns = function(x) if_else(x < 2,
as.character(x), '2+')
) ) |>
skim()| Name | mutate(…) |
| Number of rows | 2930 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| full_bath | 0 | 1 | 1 | 2 | 0 | 3 | 0 |
| half_bath | 0 | 1 | 1 | 2 | 0 | 3 | 0 |
| fireplaces | 0 | 1 | 1 | 2 | 0 | 3 | 0 |
| garage_cars | 0 | 1 | 1 | 2 | 0 | 3 | 0 |
Next, we need to transform back the 3 values as factors by adding the factor() function at the end. Notice the levels = c(0:1, ‘2+’)
Code
selected_counts |>
mutate(
across(
.cols = everything(),
.fns = function(x) if_else(x < 2,
as.character(x),
'2+'
) |>
factor(levels = c(0:1,'2+')) ) ) |>
skim()| Name | mutate(…) |
| Number of rows | 2930 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| factor | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| full_bath | 0 | 1 | FALSE | 3 | 2+: 1600, 1: 1318, 0: 12 |
| half_bath | 0 | 1 | FALSE | 3 | 0: 1843, 1: 1062, 2+: 25 |
| fireplaces | 0 | 1 | FALSE | 3 | 0: 1422, 1: 1274, 2+: 234 |
| garage_cars | 0 | 1 | FALSE | 3 | 2+: 1994, 1: 778, 0: 158 |
We can modify the dataset to include the house sale price, while excluding variables whose values are greater than 6. We also exclude basement and kitchen variables.
Code
# A tibble: 2,930 × 5
sale_price full_bath half_bath fireplaces garage_cars
<int> <int> <int> <int> <dbl>
1 215000 1 0 2 2
2 105000 1 0 0 1
3 172000 1 1 0 1
4 244000 2 1 2 2
5 189900 2 1 1 2
6 195500 2 1 1 2
7 213500 2 0 0 2
8 191500 2 0 0 2
9 236500 2 0 1 2
10 189000 2 1 1 2
# ℹ 2,920 more rows
And then, we have to make sure our across() call includes all count variables, while excluding sale_price column across(.cols = -sale_price). The end result is the grouped_counts which includes count variables and the sale_price, which is a continuous variable.
Grouped accounts using across(.cols = -sale_price)
Code
| Name | mutate(…) |
| Number of rows | 2930 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| factor | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| full_bath | 0 | 1 | FALSE | 3 | 2+: 1600, 1: 1318, 0: 12 |
| half_bath | 0 | 1 | FALSE | 3 | 0: 1843, 1: 1062, 2+: 25 |
| fireplaces | 0 | 1 | FALSE | 3 | 0: 1422, 1: 1274, 2+: 234 |
| garage_cars | 0 | 1 | FALSE | 3 | 2+: 1994, 1: 778, 0: 158 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| sale_price | 0 | 1 | 180796.1 | 79886.69 | 12789 | 129500 | 160000 | 213500 | 755000 | ▇▇▁▁▁ |
Code
# A tibble: 2,930 × 5
sale_price full_bath half_bath fireplaces garage_cars
<int> <fct> <fct> <fct> <fct>
1 215000 1 0 2+ 2+
2 105000 1 0 0 1
3 172000 1 1 0 1
4 244000 2+ 1 2+ 2+
5 189900 2+ 1 1 2+
6 195500 2+ 1 1 2+
7 213500 2+ 0 0 2+
8 191500 2+ 0 0 2+
9 236500 2+ 0 1 2+
10 189000 2+ 1 1 2+
# ℹ 2,920 more rows
Displaying counts
The grouped_counts dataset needs to be transformed into a long format for ggplot(). We need to rearrange everything but the sale price.
Code
grouped_counts |>
pivot_longer(
cols = -sale_price
) |>
ggplot(aes(x = sale_price)) +
geom_density(fill = 'grey80')Faceting the count data set
In the pivoted_longer version of the grouped_counts data set, the column name contains the values associated to baths, fireplace and garages. The value column displays the number of units, a factor variable ranging from 1, 0 to 2+.
Centering by amenities
We can center all distributions to their mean price using the scale() function. But we need to scale the distributions by the type of amenity. Notice we got lots of warning messages because of negative values that led to infinite values during the log-10 transformation. In other words, we should not use the log10 transformation.
Code
grouped_counts |>
pivot_longer(
cols = -sale_price
) |>
mutate(
sale_price = scale(sale_price),
.by = name
) |>
ggplot(aes(x = sale_price)) +
geom_density(fill = 'grey80') +
scale_x_log10(label = scales::label_dollar()) +
facet_grid(
rows = vars(name),
cols = vars(value)
) +
theme_minimal(
base_family = 'Source Sans Pro',
base_size = 16
)Changing the log-10 transformation
Replacing the log_10 transformation by declaring x-axis scale as continuous: scale_x_continuous()
Code
grouped_counts |>
pivot_longer(
cols = -sale_price
) |>
mutate(
sale_price = scale(sale_price),
.by = name
) |>
ggplot(aes(x = sale_price)) +
geom_density(fill = 'grey80') +
scale_x_continuous() +
facet_grid(
rows = vars(name),
cols = vars(value)
) +
theme_minimal(
base_family = 'Source Sans Pro',
base_size = 16
)log_transform variable
However, the scaling works better if you log10 scale the variable before using visualizing.
Code
grouped_counts |>
pivot_longer(
cols = -sale_price
) |>
mutate(
sale_price = scale(sale_price |>
log()),
.by = name
) |>
ggplot(aes(x = sale_price)) +
geom_density(fill = 'grey80') +
scale_x_continuous() +
facet_grid(
rows = vars(name),
cols = vars(value)
) +
theme_minimal(
base_family = 'Source Sans Pro',
base_size = 16
)